Crate packed_simd_2
source · [−]Expand description
Portable packed SIMD vectors
This crate is proposed for stabilization as std::packed_simd
in RFC2366:
std::simd
.
The examples available in the
examples/
sub-directory of the crate showcase how to use the library in practice.
Table of contents
Introduction
This crate exports Simd<[T; N]>
: a packed vector of N
elements of type T
as well as many type aliases for this type: for
example, f32x4
, which is just an alias for Simd<[f32; 4]>
.
The operations on packed vectors are, by default, “vertical”, that is, they are applied to each vector lane in isolation of the others:
let a = i32x4::new(1, 2, 3, 4);
let b = i32x4::new(5, 6, 7, 8);
assert_eq!(a + b, i32x4::new(6, 8, 10, 12));
Many “horizontal” operations are also provided:
assert_eq!(a.wrapping_sum(), 10);
In virtually all architectures vertical operations are fast, while horizontal operations are, by comparison, much slower. That is, the most portably-efficient way of performing a reduction over a slice is to collect the results into a vector using vertical operations, and performing a single horizontal operation at the end:
fn reduce(x: &[i32]) -> i32 {
assert_eq!(x.len() % 4, 0);
let mut sum = i32x4::splat(0); // [0, 0, 0, 0]
for i in (0..x.len()).step_by(4) {
sum += i32x4::from_slice_unaligned(&x[i..]);
}
sum.wrapping_sum()
}
let x = [0, 1, 2, 3, 4, 5, 6, 7];
assert_eq!(reduce(&x), 28);
Vector types
The vector type aliases are named according to the following scheme:
{element_type}x{number_of_lanes} == Simd<[element_type; number_of_lanes]>
where the following element types are supported:
i{element_width}
: signed integeru{element_width}
: unsigned integerf{element_width}
: floatm{element_width}
: mask (see below)*{const,mut} T
:const
andmut
pointers
Basic operations
// Sets all elements to `0`:
let a = i32x4::splat(0);
// Reads a vector from a slice:
let mut arr = [0, 0, 0, 1, 2, 3, 4, 5];
let b = i32x4::from_slice_unaligned(&arr);
// Reads the 4-th element of a vector:
assert_eq!(b.extract(3), 1);
// Returns a new vector where the 4-th element is replaced with `1`:
let a = a.replace(3, 1);
assert_eq!(a, b);
// Writes a vector to a slice:
let a = a.replace(2, 1);
a.write_to_slice_unaligned(&mut arr[4..]);
assert_eq!(arr, [0, 0, 0, 1, 0, 0, 1, 1]);
Conditional operations
One often needs to perform an operation on some lanes of the vector. Vector
masks, like m32x4
, allow selecting on which vector lanes an operation is
to be performed:
let a = i32x4::new(1, 1, 2, 2);
// Add `1` to the first two lanes of the vector.
let m = m16x4::new(true, true, false, false);
let a = m.select(a + 1, a);
assert_eq!(a, i32x4::splat(2));
The elements of a vector mask are either true
or false
. Here true
means that a lane is “selected”, while false
means that a lane is not
selected.
All vector masks implement a mask.select(a: T, b: T) -> T
method that
works on all vectors that have the same number of lanes as the mask. The
resulting vector contains the elements of a
for those lanes for which the
mask is true
, and the elements of b
otherwise.
The example constructs a mask with the first two lanes set to true
and
the last two lanes set to false
. This selects the first two lanes of a + 1
and the last two lanes of a
, producing a vector where the first two
lanes have been incremented by 1
.
note: mask
select
can be used on vector types that have the same number of lanes as the mask. The example shows this by usingm16x4
instead ofm32x4
. It is typically more performant to use a mask element width equal to the element width of the vectors being operated upon. This is, however, not true for 512-bit wide vectors when targeting AVX-512, where the most efficient masks use only 1-bit per element.
All vertical comparison operations returns masks:
let a = i32x4::new(1, 1, 3, 3);
let b = i32x4::new(2, 2, 0, 0);
// ge: >= (Greater Eequal; see also lt, le, gt, eq, ne).
let m = a.ge(i32x4::splat(2));
if m.any() {
// all / any / none allow coherent control flow
let d = m.select(a, b);
assert_eq!(d, i32x4::new(2, 2, 3, 3));
}
Conversions
-
lossless widening conversions:
From
/Into
are implemented for vectors with the same number of lanes when the conversion is value preserving (same as instd
). -
safe bitwise conversions: The cargo feature
into_bits
provides theIntoBits/FromBits
traits (x.into_bits()
). These perform safe bitwisetransmute
s when all bit patterns of the source type are valid bit patterns of the target type and are also implemented for the architecture-specific vector types ofstd::arch
. For example,let x: u8x8 = m8x8::splat(true).into_bits();
is provided because allm8x8
bit patterns are validu8x8
bit patterns. However, the opposite is not true, not allu8x8
bit patterns are validm8x8
bit-patterns, so this operation cannot be performed safely usingx.into_bits()
; one needs to useunsafe { crate::mem::transmute(x) }
for that, making sure that the value in theu8x8
is a valid bit-pattern ofm8x8
. -
numeric casts (
as
): are performed usingFromCast
/Cast
(x.cast()
), just likeas
:-
casting integer vectors whose lane types have the same size (e.g.
i32xN
->u32xN
) is a no-op, -
casting from a larger integer to a smaller integer (e.g.
u32xN
->u8xN
) will truncate, -
casting from a smaller integer to a larger integer (e.g.
u8xN
->u32xN
) will:- zero-extend if the source is unsigned, or
- sign-extend if the source is signed,
-
casting from a float to an integer will round the float towards zero,
-
casting from an integer to float will produce the floating point representation of the integer, rounding to nearest, ties to even,
-
casting from an
f32
to anf64
is perfect and lossless, -
casting from an
f64
to anf32
rounds to nearest, ties to even.
Numeric casts are not very “precise”: sometimes lossy, sometimes value preserving, etc.
-
Hardware Features
This crate can use different hardware features based on your configured
RUSTFLAGS
. For example, with no configured RUSTFLAGS
, u64x8
on
x86_64 will use SSE2 operations like PCMPEQD
. If you configure
RUSTFLAGS='-C target-feature=+avx2,+avx'
on supported x86_64 hardware
the same u64x8
may use wider AVX2 operations like VPCMPEQQ
. It is
important for performance and for hardware support requirements that
you choose an appropriate set of target-feature
and target-cpu
options during builds. For more information, see the Performance
guide
Macros
Shuffles vector elements.
Structs
Wrapper over T
implementing a lexicoraphical order via the PartialOrd
and/or Ord
traits.
Packed SIMD vector type.
8-bit wide mask.
16-bit wide mask.
32-bit wide mask.
64-bit wide mask.
128-bit wide mask.
isize-wide mask.
Traits
Numeric cast from Self
to T
.
Numeric cast from T
to Self
.
This trait is implemented by all mask types
Trait implemented by arrays that can be SIMD types.
This trait is implemented by all SIMD vector types.
Type Definitions
A vector with 2 *const T
lanes
A vector with 4 *const T
lanes
A vector with 8 *const T
lanes
A 64-bit vector with 2 f32
lanes.
A 128-bit vector with 4 f32
lanes.
A 256-bit vector with 8 f32
lanes.
A 512-bit vector with 16 f32
lanes.
A 128-bit vector with 2 f64
lanes.
A 256-bit vector with 4 f64
lanes.
A 512-bit vector with 8 f64
lanes.
A 16-bit vector with 2 i8
lanes.
A 32-bit vector with 4 i8
lanes.
A 64-bit vector with 8 i8
lanes.
A 128-bit vector with 16 i8
lanes.
A 256-bit vector with 32 i8
lanes.
A 512-bit vector with 64 i8
lanes.
A 32-bit vector with 2 i16
lanes.
A 64-bit vector with 4 i16
lanes.
A 128-bit vector with 8 i16
lanes.
A 256-bit vector with 16 i16
lanes.
A 512-bit vector with 32 i16
lanes.
A 64-bit vector with 2 i32
lanes.
A 128-bit vector with 4 i32
lanes.
A 256-bit vector with 8 i32
lanes.
A 512-bit vector with 16 i32
lanes.
A 128-bit vector with 2 i64
lanes.
A 256-bit vector with 4 i64
lanes.
A 512-bit vector with 8 i64
lanes.
A 128-bit vector with 1 i128
lane.
A 256-bit vector with 2 i128
lanes.
A 512-bit vector with 4 i128
lanes.
A vector with 2 isize
lanes.
A vector with 4 isize
lanes.
A vector with 8 isize
lanes.
A 16-bit vector mask with 2 m8
lanes.
A 32-bit vector mask with 4 m8
lanes.
A 64-bit vector mask with 8 m8
lanes.
A 128-bit vector mask with 16 m8
lanes.
A 256-bit vector mask with 32 m8
lanes.
A 512-bit vector mask with 64 m8
lanes.
A 32-bit vector mask with 2 m16
lanes.
A 64-bit vector mask with 4 m16
lanes.
A 128-bit vector mask with 8 m16
lanes.
A 256-bit vector mask with 16 m16
lanes.
A 512-bit vector mask with 32 m16
lanes.
A 64-bit vector mask with 2 m32
lanes.
A 128-bit vector mask with 4 m32
lanes.
A 256-bit vector mask with 8 m32
lanes.
A 512-bit vector mask with 16 m32
lanes.
A 128-bit vector mask with 2 m64
lanes.
A 256-bit vector mask with 4 m64
lanes.
A 512-bit vector mask with 8 m64
lanes.
A 128-bit vector mask with 1 m128
lane.
A 256-bit vector mask with 2 m128
lanes.
A 512-bit vector mask with 4 m128
lanes.
A vector with 2 *mut T
lanes
A vector with 4 *mut T
lanes
A vector with 8 *mut T
lanes
A vector mask with 2 msize
lanes.
A vector mask with 4 msize
lanes.
A vector mask with 8 msize
lanes.
A 16-bit vector with 2 u8
lanes.
A 32-bit vector with 4 u8
lanes.
A 64-bit vector with 8 u8
lanes.
A 128-bit vector with 16 u8
lanes.
A 256-bit vector with 32 u8
lanes.
A 512-bit vector with 64 u8
lanes.
A 32-bit vector with 2 u16
lanes.
A 64-bit vector with 4 u16
lanes.
A 128-bit vector with 8 u16
lanes.
A 256-bit vector with 16 u16
lanes.
A 512-bit vector with 32 u16
lanes.
A 64-bit vector with 2 u32
lanes.
A 128-bit vector with 4 u32
lanes.
A 256-bit vector with 8 u32
lanes.
A 512-bit vector with 16 u32
lanes.
A 128-bit vector with 2 u64
lanes.
A 256-bit vector with 4 u64
lanes.
A 512-bit vector with 8 u64
lanes.
A 128-bit vector with 1 u128
lane.
A 256-bit vector with 2 u128
lanes.
A 512-bit vector with 4 u128
lanes.
A vector with 2 usize
lanes.
A vector with 4 usize
lanes.
A vector with 8 usize
lanes.